Building Space-Efficient Inverted Indexes on Low-Cardinality Dimensions
نویسندگان
چکیده
Many modern applications naturally lead to the implementation of inverted indexes for effectively managing large collections of data items. Creating an inverted index on a low cardinality data domain results in replication of data descriptors, leading to increased storage overhead. For example, the use of RFID or similar sensing devices in supply-chains results in massive tracking datasets that need effective spatial or spatio-temporal indexes on them. As the volume of data grows proportionally larger than the number of spatial locations or time epochs, it is unavoidable that many of the resulting lists share large subsets of common items. In this paper we present techniques that exploit this characteristic of modern big-data applications in order to losslessly compress the resulting inverted indexes by discovering large common item sets and adapting the index so as to store just one copy of them. We apply our method in the supply chain domain using modern big-data tools and show that our techniques in many cases achieve compression ratios that exceed 50%.
منابع مشابه
Efficient Skycube Computation Using Bitmaps Derived from Indexes
TAMBARAM KAILASAM, GAYATHRI. Efficient Skycube Computation using Bitmaps derived from Indexes. (Under the direction of Dr. Jaewoo Kang.) Skyline queries have been increasingly used in multi-criteria decision making and data mining applications. They retrieve a set of interesting points from a potentially large set of data points. A point is said to be interesting if it is as good or better in a...
متن کاملSkylight Design Regulation for Residential Building in Hamadan City
Skylights or light wells are an integral part of the design of low- and high-depth buildings. The design of these skylights in different areas is based on specific criteria. According to Hamedan's criteria, only the dimensions of these skylights and the ratio of skylight area to height of the skylight are enough to design skylights. The purpose of this study was to evaluate the accuracy of sk...
متن کاملEfficient Phrase Querying with an Auxiliary Index
Search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this paper we consider how phrase queries can be efficiently supported with low disk overheads. Previous research has shown that phrase queries can be rapidly evaluated using nextword index...
متن کاملEfficient Phrase Querying with an Auxiliary Index
Search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this paper we consider how phrase queries can be efficiently supported with low disk overheads. Previous research has shown that phrase queries can be rapidly evaluated using nextword index...
متن کاملA retrieval technique for high-dimensional data and partially specified queries
While the persistent data of many advanced database applications, such as OLAP and scientific studies, are characterized by very high dimensionality, typical queries posed on these data appeal to a small number of relevant dimensions. Unfortunately, the multi-dimensional access methods designed for high-dimensional data perform rather poorly for these partially specified queries. The retrieval ...
متن کامل